Datascope: Mining Biological Sequences
نویسنده
چکیده
MUCH OF THE WORK IN THE data-mining community defines mining data as a collection of techniques for extracting knowledge out of large databases. This definition is a bit ambiguous because the “knowledge” extracted from databases varies dramatically across systems. In the spirit of intelligent systems, I suggest we define data mining roughly as a collection of techniques that produce representations that efficiently support a class of queries about the original database. I argue that the ability to efficiently and accurately answer a wide range of queries is a plausible measure of “understanding” the data. By queries, I do not mean traditional queries. The exact definition of the query language depends on the application. Our informal definition suggests the need to develop formal criteria that measure an intelligent system’s effectiveness in terms of its ability to efficiently and accurately construct local representations, integrate these representations with a global architecture, and answer complex questions about the domain. Mining biological databases is a particularly important challenge for this philosophy. The ultimate goal of mining biological sequences is to develop a comprehensive computational framework that can take a genomic DNA sequence and automatically produce a detailed annotation of the organism—one that describes all the genes, how the genes are regulated by regulatory elements, and how the genes function together to produce higher function and behavior levels. The research in the intelligent bioinformatics community focuses on learning algorithms, data-mining tools, and systems that will let us transform biological sequences, observations, and knowledge into structured and meaningful information that scientists can query, visualize, and understand. I call this approach to mining biological sequences Datascope, because it focuses on providing biologists with the capability of probing genomic data with various degrees of detail and a wide range of viewpoints. The approach produces systems that help discover the fundamental connections between genetic sequences and functions of living organisms. Computational-genomics problems
منابع مشابه
Mining Biological Repetitive Sequences Using Support Vector Machines and Fuzzy SVM
Structural repetitive subsequences are most important portion of biological sequences, which play crucial roles on corresponding sequence’s fold and functionality. Biggest class of the repetitive subsequences is “Transposable Elements” which has its own sub-classes upon contexts’ structures. Many researches have been performed to criticality determine the structure and function of repetitiv...
متن کاملHigh Fuzzy Utility Based Frequent Patterns Mining Approach for Mobile Web Services Sequences
Nowadays high fuzzy utility based pattern mining is an emerging topic in data mining. It refers to discover all patterns having a high utility meeting a user-specified minimum high utility threshold. It comprises extracting patterns which are highly accessed in mobile web service sequences. Different from the traditional fuzzy approach, high fuzzy utility mining considers not only counts of mob...
متن کاملAlgorithm for Sequences Data Mining in Biology
Biological data mining has become an important research area in recent years due to the exponential increase in the amount of biological data. In this paper, we discuss the applications of data mining methods in biological sequences analysis. The algorithms we analyze fall into support vector machine type Key-Words: Data mining, biology, vector machine, biological sequences.
متن کاملMining frequent biological sequences based on bitmap without candidate sequence generation
Biological sequences carry a lot of important genetic information of organisms. Furthermore, there is an inheritance law related to protein function and structure which is useful for applications such as disease prediction. Frequent sequence mining is a core technique for association rule discovery, but existing algorithms suffer from low efficiency or poor error rate because biological sequenc...
متن کاملA Brief Survey On Data mining For Biological and Environmental Problems
In the past, many researchers used data mining techniques in any area. A lot of amounts of data have been collected from scientific domains such as geo sciences, astronomy, meteorology, geology and biological sciences. Data mining techniques and tools used by researchers in biological and environmental problems also. In biological science data mining used in sequences alignment is based on the ...
متن کامل